Customer Churn Analysis¶

The story Behind The Data¶

A bank is concerned that more and more customers are leaving its credit card services. They would really appreciate if someone could analyze it for them, in order to understand the main reasons for leaving the services, and to come up with recommendations for how the bank can mitigate that. Eventually, the bank would like to proactively implement these recommendations in order to keep their customers happy.

A full ERD can be found here

Data Description¶

In this task, few datasets are provided:

  1. BankChurners.csv - this file contains basic information about each client (10 columns). The columns are:

    • CLIENTNUM - Client number. Unique identifier for the customer holding the account;
    • Attrition Flag - Internal event (customer activity) variable - if the client had churned (attrited) or not (existing).
    • Dependent Count - Demographic variable - Number of dependents
    • Card_Category - Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
    • Months_on_book - Period of relationship with bank
    • Months_Inactive_12_mon - No. of months inactive in the last 12 months
    • Contacts_Count_12_mon - No. of Contacts in the last 12 months
    • Credit_Limit - Credit Limit on the Credit Card
    • Avg_Open_To_Buy - Open to Buy Credit Line (Average of last 12 months)
    • Avg_Utilization_Ratio - Average Card Utilization Ratio
  2. basic_client_info.csv - this file contains some basic client info per each client

(6 columns) - - CLIENTNUM - Client number. Unique identifier for the customer holding the account - Customer Age - Demographic variable - Customer's Age in Years - Gender - Demographic variable - M=Male, F=Female - Education_Level - Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.-Marital_Status- Demographic variable - Married, Single, Divorced, Unknown -Income_Category` - Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)

  1. enriched_churn_data.csv - this file contains some enriched data about each client (7 columns) -
    • CLIENTNUM - Client number. Unique identifier for the customer holding the account
    • Total_Relationship_Count - Total no. of products held by the customer
    • Total_Revolving_Bal - Total Revolving Balance on the Credit Card
    • Total_Amt_Chng_Q4_Q1 - Change in Transaction Amount (Q4 over Q1)
    • Total_Trans_Amt - Total Transaction Amount (Last 12 months)
    • Total_Trans_Ct - Total Transaction Count (Last 12 months)
    • Total_Ct_Chng_Q4_Q1 - Change in Transaction Count (Q4 over Q1)
In [1]:
# import necessary libraries
import csv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import plotly.express as px
import plotly.io as pio
import plotly.graph_objs as go
pio.renderers.default = 'notebook'
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

Data wrangling¶

In [2]:
# import from csv
data0 = pd.read_csv('bankchurners.csv')
data1 = pd.read_csv('enriched_churn_data.csv')
data2 = pd.read_csv('basic_client_info.csv')
In [3]:
# merge all tables
data01 = pd.merge(data0, data1, on='clientnum')
data = pd.merge(data01, data2, on='clientnum')
df = pd.DataFrame(data=data)
In [4]:
pd.set_option('max_colwidth', 10)
pd.set_option('display.max_columns', 21)
df.sample(10)
Out[4]:
clientnum attrition_flag dependent_count card_category months_on_book months_inactive_12_mon contacts_count_12_mon credit_limit avg_open_to_buy avg_utilization_ratio total_relationship_count total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_ct_chng_q4_q1 total_trans_ct customer_age gender education_level marital_status income_category
9471 768655158 Existi... 0 Blue 23 3 2 6224.0 5132.0 0.175 3 1092 0.919 2222 0.560 39 29 M Unknown Married $60K -...
10098 713110908 Existi... 1 Blue 13 6 5 2607.0 641.0 0.754 3 1966 0.650 1719 0.607 45 30 M Unknown Married Less t...
8134 711766758 Existi... 4 Blue 30 2 1 1531.0 544.0 0.645 3 987 0.831 4373 0.811 67 42 F Unknown Married Less t...
92 709597983 Existi... 0 Blue 56 1 3 1573.0 920.0 0.415 6 653 0.710 3349 0.674 77 65 M Doctorate Divorced Less t...
314 819943458 Attrit... 3 Blue 52 4 3 34516.0 33516.0 0.029 6 1000 0.694 1777 0.542 37 57 M Graduate Single $120K +
2670 709902108 Existi... 2 Blue 40 3 2 8169.0 7375.0 0.097 4 794 0.665 4927 0.944 70 51 M High S... Married $80K -...
923 708525108 Existi... 2 Blue 47 3 3 2156.0 1022.0 0.526 4 1134 0.669 4120 0.711 77 54 F Graduate Married Unknown
1093 715025208 Existi... 2 Blue 46 3 3 3119.0 1341.0 0.570 5 1778 1.010 4168 0.853 63 56 F College Married Less t...
653 767348733 Existi... 2 Blue 49 3 3 4058.0 3265.0 0.195 3 793 0.758 15865 0.667 105 56 M Unknown Married $60K -...
9187 710577708 Existi... 1 Blue 25 3 3 2657.0 674.0 0.746 5 1983 0.831 2591 0.903 59 35 M Unknown Single $40K -...
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   clientnum                 10127 non-null  int64  
 1   attrition_flag            10127 non-null  object 
 2   dependent_count           10127 non-null  int64  
 3   card_category             10127 non-null  object 
 4   months_on_book            10127 non-null  int64  
 5   months_inactive_12_mon    10127 non-null  int64  
 6   contacts_count_12_mon     10127 non-null  int64  
 7   credit_limit              10127 non-null  float64
 8   avg_open_to_buy           10127 non-null  float64
 9   avg_utilization_ratio     10127 non-null  float64
 10  total_relationship_count  10127 non-null  int64  
 11  total_revolving_bal       10127 non-null  int64  
 12  total_amt_chng_q4_q1      10127 non-null  float64
 13  total_trans_amt           10127 non-null  int64  
 14  total_ct_chng_q4_q1       10127 non-null  float64
 15  total_trans_ct            10127 non-null  int64  
 16  customer_age              10127 non-null  int64  
 17  gender                    10127 non-null  object 
 18  education_level           10127 non-null  object 
 19  marital_status            10127 non-null  object 
 20  income_category           10127 non-null  object 
dtypes: float64(5), int64(10), object(6)
memory usage: 1.7+ MB

Data Cleaning¶

In [8]:
#make a copy
df_clean = df.copy()
In [9]:
#check duplicates
df_clean.duplicated().sum()
Out[9]:
0
In [10]:
#check nulls
df_clean.attrition_flag.isna().sum()
Out[10]:
0
In [11]:
## Pre-processing
In [12]:
cat_cols = [c for c in df_clean.columns if df_clean[c].dtype == 'object']
cat_cols               
Out[12]:
['attrition_flag',
 'card_category',
 'gender',
 'education_level',
 'marital_status',
 'income_category']
In [13]:
num_cols = [n for n in df_clean.columns if df_clean[n].dtype != 'object']
num_cols  
Out[13]:
['clientnum',
 'dependent_count',
 'months_on_book',
 'months_inactive_12_mon',
 'contacts_count_12_mon',
 'credit_limit',
 'avg_open_to_buy',
 'avg_utilization_ratio',
 'total_relationship_count',
 'total_revolving_bal',
 'total_amt_chng_q4_q1',
 'total_trans_amt',
 'total_ct_chng_q4_q1',
 'total_trans_ct',
 'customer_age']
In [14]:
df_clean[num_cols].describe()
Out[14]:
clientnum dependent_count months_on_book months_inactive_12_mon contacts_count_12_mon credit_limit avg_open_to_buy avg_utilization_ratio total_relationship_count total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_ct_chng_q4_q1 total_trans_ct customer_age
count 1.0127... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127.... 10127....
mean 7.3917... 2.346203 35.928409 2.341167 2.455317 8631.9... 7469.1... 0.274894 3.812580 1162.8... 0.759941 4404.0... 0.712222 64.858695 46.325960
std 3.6903... 1.298908 7.986416 1.010622 1.106225 9088.7... 9090.6... 0.275691 1.554408 814.98... 0.219207 3397.1... 0.238086 23.472570 8.016814
min 7.0808... 0.000000 13.000000 0.000000 0.000000 1438.3... 3.000000 0.000000 1.000000 0.000000 0.000000 510.00... 0.000000 10.000000 26.000000
25% 7.1303... 1.000000 31.000000 2.000000 2.000000 2555.0... 1324.5... 0.023000 3.000000 359.00... 0.631000 2155.5... 0.582000 45.000000 41.000000
50% 7.1792... 2.000000 36.000000 2.000000 2.000000 4549.0... 3474.0... 0.176000 4.000000 1276.0... 0.736000 3899.0... 0.702000 67.000000 46.000000
75% 7.7314... 3.000000 40.000000 3.000000 3.000000 11067.... 9859.0... 0.503000 5.000000 1784.0... 0.859000 4741.0... 0.818000 81.000000 52.000000
max 8.2834... 5.000000 56.000000 6.000000 6.000000 34516.... 34516.... 0.999000 6.000000 2517.0... 3.397000 18484.... 3.714000 139.00... 73.000000
In [15]:
df_clean.attrition_flag.value_counts()  #below the anual churning rate?
Out[15]:
Existing Customer    8500
Attrited Customer    1627
Name: attrition_flag, dtype: int64
In [16]:
# Define a dictionary mapping the categories to binary values
dict_ch = {
    'Existing Customer': 0,
    'Attrited Customer': 1,
}

# Use the map() method to replace the values in the 'attrition_flag' column
df_clean['churned'] = df_clean['attrition_flag'].map(dict_ch)
In [17]:
# Drop the original 'attrition_flag' column
df_clean.drop('attrition_flag', axis=1, inplace=True)
In [18]:
df_clean
Out[18]:
clientnum dependent_count card_category months_on_book months_inactive_12_mon contacts_count_12_mon credit_limit avg_open_to_buy avg_utilization_ratio total_relationship_count total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_ct_chng_q4_q1 total_trans_ct customer_age gender education_level marital_status income_category churned
0 806160108 1 Blue 56 2 3 3193.0 676.0 0.788 2 2517 1.831 1336 1.143 30 61 M High S... Married $40K -... 0
1 804424383 1 Blue 56 3 2 10215.0 9205.0 0.099 3 1010 0.843 1904 1.000 40 63 M Unknown Married $60K -... 0
2 708300483 0 Blue 56 4 3 7882.0 7277.0 0.077 5 605 1.052 704 0.143 16 66 F Doctorate Married Unknown 1
3 808284783 1 Blue 56 0 0 1438.3 1438.3 0.000 6 0 0.813 1951 1.095 44 62 F Unknown Married Less t... 0
4 712720158 1 Blue 56 2 3 13860.0 12208.0 0.119 5 1652 1.255 1910 1.909 32 68 M Graduate Married Unknown 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10122 714082308 3 Blue 13 1 3 8656.0 8656.0 0.000 2 0 0.893 8056 0.935 89 39 M High S... Single $80K -... 1
10123 710751033 1 Blue 13 1 3 4829.0 4829.0 0.000 1 0 0.836 8286 0.857 65 31 F Graduate Single Less t... 1
10124 780118158 0 Blue 13 2 4 3585.0 1165.0 0.675 2 2420 0.812 15656 0.746 124 27 M Graduate Single Less t... 0
10125 779082633 2 Blue 13 2 3 4107.0 3128.0 0.238 3 979 0.647 14596 0.733 104 30 M Uneduc... Single $60K -... 0
10126 788633733 3 Silver 13 3 2 28174.0 27001.0 0.042 4 1173 0.779 16258 0.714 108 36 M Graduate Married $60K -... 0

10127 rows × 21 columns

In [19]:
# Calculate the percentage of churned customers
num_churned = df_clean[df_clean["churned"] == 1].shape[0]
num_total = df_clean.shape[0]
percent_churned = (num_churned / num_total) * 100

# Format the result as a percentage
print("Percentage of churned customers: {:.2f}%".format(percent_churned))
Percentage of churned customers: 16.07%

Observations¶

The annual churn rate for credit cards can vary greatly depending on the type of card and the target demographic. On average, it is common to see an annual churn rate of around 20% for credit cards. However, some high-end, premium credit cards may have a lower churn rate, while cards targeted towards subprime borrowers may have a higher churn rate.

It's important to note that a lower churn rate is typically seen as a positive sign for a credit card issuer, as it indicates that customers are satisfied with the card and less likely to switch to a different one. On the other hand, a higher churn rate can indicate that customers are not satisfied with the card or are switching to a different one with more attractive terms.

A preliminary overview of the dataset shows that around 83.93% of customers are active, while 16.07% have churned.

In [20]:
#Convert 'attrition_flag'to boolean dtype
dict_g = {
    'F': ['Female'],
    'M': [' Male'],
}
df_clean['gender'] = df_clean['gender']
df_clean.replace({"gender": dict_g}, inplace=True)
In [21]:
df.describe(include='object')
Out[21]:
attrition_flag card_category gender education_level marital_status income_category
count 10127 10127 10127 10127 10127 10127
unique 2 4 2 7 4 6
top Existi... Blue F Graduate Married Less t...
freq 8500 9436 5358 3128 4687 3561
In [22]:
df_clean.describe().loc[['min', 'max', 'mean']].round(2)
Out[22]:
clientnum dependent_count months_on_book months_inactive_12_mon contacts_count_12_mon credit_limit avg_open_to_buy avg_utilization_ratio total_relationship_count total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_ct_chng_q4_q1 total_trans_ct customer_age churned
min 7.0808... 0.00 13.00 0.00 0.00 1438.30 3.00 0.00 1.00 0.00 0.00 510.00 0.00 10.00 26.00 0.00
max 8.2834... 5.00 56.00 6.00 6.00 34516.00 34516.00 1.00 6.00 2517.00 3.40 18484.00 3.71 139.00 73.00 1.00
mean 7.3917... 2.35 35.93 2.34 2.46 8631.95 7469.14 0.27 3.81 1162.81 0.76 4404.09 0.71 64.86 46.33 0.16
In [23]:
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   clientnum                 10127 non-null  int64  
 1   dependent_count           10127 non-null  int64  
 2   card_category             10127 non-null  object 
 3   months_on_book            10127 non-null  int64  
 4   months_inactive_12_mon    10127 non-null  int64  
 5   contacts_count_12_mon     10127 non-null  int64  
 6   credit_limit              10127 non-null  float64
 7   avg_open_to_buy           10127 non-null  float64
 8   avg_utilization_ratio     10127 non-null  float64
 9   total_relationship_count  10127 non-null  int64  
 10  total_revolving_bal       10127 non-null  int64  
 11  total_amt_chng_q4_q1      10127 non-null  float64
 12  total_trans_amt           10127 non-null  int64  
 13  total_ct_chng_q4_q1       10127 non-null  float64
 14  total_trans_ct            10127 non-null  int64  
 15  customer_age              10127 non-null  int64  
 16  gender                    10127 non-null  object 
 17  education_level           10127 non-null  object 
 18  marital_status            10127 non-null  object 
 19  income_category           10127 non-null  object 
 20  churned                   10127 non-null  int64  
dtypes: float64(5), int64(11), object(5)
memory usage: 1.7+ MB
In [24]:
corr_matrix = df_clean.corr(numeric_only=True)

fig = px.imshow(corr_matrix,
color_continuous_scale='YlOrRd',
aspect="auto",
title='Correlation Matrix')
fig.update_traces(
text=corr_matrix.round(2),
texttemplate="%{text:.2f}",
textfont_size=12,
hovertemplate=
"Feature 1: %{y}<br>Feature 2: %{x}<br>Correlation: %{text:.2f}<extra></extra>",
customdata=np.moveaxis(corr_matrix.values, 0, -1))
fig.update_xaxes(tickangle=90, tickfont=dict(size=12), title="")
fig.update_yaxes(tickfont=dict(size=12), title="")
fig.update_coloraxes(showscale=True, colorbar=dict(len=0.4, y=0.75))
fig.show()
In [25]:
# filter for churned
df_churned = df_clean[df_clean['churned'] == 1]

corr_matrix = df_churned.corr(numeric_only=True)

fig = px.imshow(corr_matrix,
                color_continuous_scale='YlOrRd',
                aspect="auto",
                title='Correlation Matrix for Churned')
fig.update_traces(
    text=corr_matrix.round(2),
    texttemplate="%{text:.2f}",
    textfont_size=12,
    hovertemplate=
    "Feature 1: %{y}<br>Feature 2: %{x}<br>Correlation: %{text:.2f}<extra></extra>",
    customdata=np.moveaxis(corr_matrix.values, 0, -1))
fig.update_xaxes(tickangle=90, tickfont=dict(size=12), title="")
fig.update_yaxes(tickfont=dict(size=12), title="")
fig.update_coloraxes(showscale=True, colorbar=dict(len=0.4, y=0.75))
fig.show()
In [26]:
X = df_clean.drop('churned', axis=1)
y = df_clean['churned']

# One-hot encode categorical variables
X_encoded = pd.get_dummies(X,
                           columns=[
                               'card_category', 'gender', 'education_level',
                               'marital_status', 'income_category'
                           ])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

# Train the XGBoost model
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)

# Evaluate the model on the testing set
y_pred = xgb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

print('Accuracy:', accuracy)
print('Confusion matrix:', confusion_mat)

# Visualize the model's feature importances
importances = pd.Series(xgb_model.feature_importances_,
                        index=X_encoded.columns)
importances_nonzero = importances[importances != 0].sort_values(ascending=True)

# Create a horizontal bar chart of feature importances
fig = go.Figure(
    go.Bar(x=importances_nonzero.values,
           y=importances_nonzero.index,
           orientation='h'))

fig.update_layout(title='Feature Importances (XGBoost)',
                  xaxis_title='F score',
                  yaxis_title='Feature')

fig.show()
Accuracy: 0.9772951628825272
Confusion matrix: [[1701   19]
 [  27  279]]
In [27]:
# Retrain the XGBoost model with the most important features
important_features = importances_nonzero.tail(10).index.tolist()
X_important = X_encoded[important_features]
X_train_important, X_test_important, y_train, y_test = train_test_split(
    X_important, y, test_size=0.2, random_state=42)
xgb_model_important = xgb.XGBClassifier()
xgb_model_important.fit(X_train_important, y_train)

# Evaluate the model on the testing set
y_pred = xgb_model_important.predict(X_test_important)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)
print('Confusion matrix:', confusion_mat)
Accuracy: 0.9763079960513327
Precision: 0.9358108108108109
Recall: 0.9052287581699346
F1 Score: 0.920265780730897
Confusion matrix: [[1701   19]
 [  29  277]]
In [28]:
# Create an XGBoost classifier with default hyperparameters
clf = xgb.XGBClassifier()

# Perform 5-fold cross-validation on the data
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1')

# Print the average F1 score across all folds
print('Average F1 score:', scores.mean())
Average F1 score: 0.8996300666012841
In [29]:
# initialize the logistic regression model
lr_model = LogisticRegression()

# train the model on the training data
lr_model.fit(X_train, y_train)

# make predictions on the testing data
y_pred = lr_model.predict(X_test)

# evaluate the model's performance
print(classification_report(y_test, y_pred, zero_division=1))
              precision    recall  f1-score   support

           0       0.85      1.00      0.92      1720
           1       1.00      0.00      0.00       306

    accuracy                           0.85      2026
   macro avg       0.92      0.50      0.46      2026
weighted avg       0.87      0.85      0.78      2026

  • In the context of XGBoost, the F-score refers to the feature importance score calculated for each feature. The F-score is a measure of the total reduction in the impurity of the model when splitting on a particular feature. It is used to rank the importance of features in the model and can be used to identify the most relevant features for prediction. Features with higher F-scores are considered more important to the model.
In [30]:
# Create the DMatrix
dmatrix = xgb.DMatrix(data=X_encoded, label=y)

# Create the parameter dictionary: params
params = {"objective": "binary:logistic", "max_depth": 2}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=dmatrix, num_boost_round=10)

# Plot the first tree
fig = plt.figure(figsize=(25, 20), dpi=300)
xgb.plot_tree(xg_reg, num_trees=0, ax=plt.gca(), fontsize=14)
plt.show()
In [31]:
# Plot the fifth tree
fig = plt.figure(figsize=(25, 20), dpi=300)
xgb.plot_tree(xg_reg, num_trees=4, ax=plt.gca(), fontsize=14)
plt.show()

Economics¶

In [32]:
# Plot distribution of total transaction count by attrition flag
plt.figure(figsize=(10, 6),dpi=150)
sns.kdeplot(data=df_clean,
            x='total_trans_ct',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title('Distribution of Total Transaction Count by Attrition Flag')
plt.xlabel('Total Transaction Count')
plt.ylabel('Density')
plt.show()
In [33]:
churned_trans_by_gender_age = df_clean[df_clean['churned'] == 1].groupby(
    ['gender'])['total_trans_ct'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_trans_by_gender_age
Out[33]:
count mean std min 25% 50% 75% max
gender
Male 697.0 46.110473 17.437841 10.0 35.0 44.0 57.0 91.0
Female 930.0 44.051613 11.909848 12.0 38.0 43.0 49.0 94.0
In [34]:
# Plot distribution of total revolving balance by attrition flag
plt.figure(figsize=(10, 6), dpi=150)
sns.kdeplot(data=df_clean,
            x='total_revolving_bal',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title('Distribution of Total Revolving Balance by Attrition Flag')
plt.xlabel('Total Revolving Balance')
plt.ylabel('Density')
plt.show()
In [35]:
churned_bal_by_gender_age = df_clean[df_clean['churned'] == 1].groupby(
    ['gender'])['total_revolving_bal'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_bal_by_gender_age
Out[35]:
count mean std min 25% 50% 75% max
gender
Male 697.0 680.315638 917.519035 0.0 0.0 0.0 1307.00 2517.0
Female 930.0 667.207527 924.726152 0.0 0.0 0.0 1298.25 2517.0
In [36]:
# Plot distribution of total relationship count by attrition flag
plt.figure(figsize=(10, 6), dpi=150)
sns.kdeplot(data=df_clean,
            x='total_relationship_count',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title('Distribution of Total Relationship Count by Attrition Flag')
plt.xlabel('Total Relationship Count')
plt.ylabel('Density')
plt.show()
In [37]:
churned_count_by_relationship = df_clean[df_clean['churned'] == 1].groupby(
    ['total_relationship_count'])['clientnum'].count()
churned_count_by_relationship.describe()
Out[37]:
count      6.000000
mean     271.166667
std       81.715156
min      196.000000
25%      225.500000
50%      230.000000
75%      317.750000
max      400.000000
Name: clientnum, dtype: float64
In [38]:
# Plot distribution of total transaction amount by attrition flag
plt.figure(figsize=(10, 6), dpi=150)
sns.kdeplot(data=df_clean,
            x='total_trans_amt',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title('Distribution of Total Transaction Amount by Attrition Flag')
plt.xlabel('Total Transaction Amount')
plt.ylabel('Density')
plt.show()
In [39]:
churned_trans_by_gender_age = df_clean[df_clean['churned'] == 1].groupby(
    ['gender'])['total_trans_amt'].describe()
In [40]:
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_trans_by_gender_age
Out[40]:
count mean std min 25% 50% 75% max
gender
Male 697.0 3509.779053 2766.604342 530.0 1759.00 2270.0 4883.00 10583.0
Female 930.0 2784.183871 1834.421181 510.0 2013.25 2360.0 2645.75 10294.0
In [41]:
# Plot distribution of total count change by attrition flag
plt.figure(figsize=(10, 6),dpi=150)
sns.kdeplot(data=df_clean,
            x='total_ct_chng_q4_q1',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title('Distribution of Total Count Change by Attrition Flag')
plt.xlabel('Total Count Change Q4-Q1')
plt.ylabel('Density')
plt.show()
In [42]:
churned_total_ct_chng_q4_q1_by_gender = df_clean[
    df_clean['churned'] == 1].groupby(['gender'
                                       ])['total_ct_chng_q4_q1'].describe()
pd.set_option('max_colwidth', 20)
pd.set_option('display.max_columns', 10)
churned_total_ct_chng_q4_q1_by_gender
Out[42]:
count mean std min 25% 50% 75% max
gender
Male 697.0 0.579077 0.252603 0.0 0.4 0.552 0.735 2.500
Female 930.0 0.535881 0.203639 0.0 0.4 0.517 0.654 1.684

Client Activity¶

In [43]:
# Set style
sns.set_style('whitegrid')

# Create distribution plot of contacts count by attrition flag
plt.figure(figsize=(10, 6),dpi=150)
sns.kdeplot(data=df_clean,
            x='contacts_count_12_mon',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title(
    'Distribution of Contacts Count in the Last 12 Months by Attrition Flag')
plt.xlabel('Contacts Count in the Last 12 Months')
plt.ylabel('Density')
plt.show()
In [44]:
# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')

# Create histogram of contacts count by attrition flag
plt.figure(figsize=(10, 6),dpi=150)
sns.histplot(data=df_clean,
            x='contacts_count_12_mon',
            hue='churned',
            multiple='stack',
            palette='cool')
plt.title('Distribution of Contacts Count in the Last 12 Months by Attrition Flag')
plt.xlabel('Contacts Count in the Last 12 Months')
plt.ylabel('Count')
plt.show()
In [45]:
# Plot distribution of months on book by attrition flag
plt.figure(figsize=(10, 6),dpi=150)
sns.kdeplot(data=df_clean,
            x='months_on_book',
            hue='churned',
            fill=True,
            alpha=0.5,
            palette='cool')
plt.title('Distribution of Months on Book by Attrition Flag')
plt.xlabel('Months on Book')
plt.ylabel('Density')
plt.show()
  • Customer retention: Since the mean number of months on the book is 36, we can assume that a significant proportion of the customers have been with the bank for a relatively long period of time. This could indicate that the bank has been successful in retaining its customers over a longer period of time.
In [ ]: